Skip to content

sketch out improved performance by refactoring codec pipeline logic#3719

Draft
d-v-b wants to merge 32 commits intozarr-developers:mainfrom
d-v-b:perf/smarter-codecs
Draft

sketch out improved performance by refactoring codec pipeline logic#3719
d-v-b wants to merge 32 commits intozarr-developers:mainfrom
d-v-b:perf/smarter-codecs

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 25, 2026

This builds on top of #3715 and achieves even more perf improvements by refactoring the basic logic of codec encoding / decoding. The design document behind these changes is here.

I do not think this is merge-worthy, as it's far too big. But i'm going to post the performance gains, and start figuring out how to break this into pieces.

A big feature this adds is the ability to write individual subchunks to uncompressed shards on storage backends that support range writes (local and memory).

Benchmark comparison: perf/smarter-codecs vs main

Test Name perf/smarter-codecs (ms) main (ms) Speedup
test_write_array[memory-chunks=100,shards=1M-None] 46.85 1018.60 21.74×
test_write_array[local-chunks=100,shards=1M-None] 68.45 1006.27 14.70×
test_sharded_morton_indexing[(32,32,32)] 22.45 247.49 11.02×
test_slice_indexing[None-(0,0,0)] 0.03 0.28 10.05×
test_sharded_morton_indexing_large[(33,33,33)] 254.85 2521.32 9.89×
test_slice_indexing[(50,50,50)-full_slice] 9.12 89.80 9.85×
test_sharded_morton_indexing_large[(32,32,32)] 226.80 2153.47 9.50×
test_sharded_morton_indexing_large[(30,30,30)] 181.86 1725.29 9.49×
test_slice_indexing[(50,50,50)-(0,0,0)] 0.06 0.60 9.45×
test_write_array[memory-chunks=100,shards=1M-gzip] 211.85 1978.29 9.34×
test_slice_indexing[None-(slice(None,10,None))*3] 0.03 0.28 9.28×
test_write_array[local-chunks=100,shards=1M-gzip] 217.00 1965.37 9.06×
test_slice_indexing[(50,50,50)-strided_4] 8.96 80.44 8.98×
test_slice_indexing[(50,50,50)-strided_4_offset] 4.83 43.21 8.95×
test_sharded_morton_indexing[(16,16,16)] 2.78 23.26 8.37×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3] 0.07 0.60 8.36×
test_slice_indexing[None-full_slice] 11.07 85.12 7.69×
test_read_array[memory-chunks=100,shards=1M-gzip] 181.44 1372.13 7.56×
test_read_array[memory-chunks=100,shards=1M-None] 80.73 609.53 7.55×
test_read_array[memory-chunks=1K,no_shards-None] 7.21 53.52 7.43×
test_read_array[local-chunks=100,shards=1M-None] 83.79 612.18 7.31×
test_slice_indexing[None-strided_4] 12.19 84.98 6.97×
test_read_array[memory-chunks=1K,no_shards-gzip] 16.79 115.69 6.89×
test_write_array[memory-chunks=1K,no_shards-gzip] 32.54 219.93 6.76×
test_read_array[local-chunks=100,shards=1M-gzip] 190.20 1277.12 6.71×
test_read_array[local-chunks=1K,no_shards-None] 24.23 142.33 5.87×
test_slice_indexing[None-mixed_slice] 0.12 0.71 5.84×
test_read_array[local-chunks=1K,no_shards-gzip] 37.83 216.45 5.72×
test_slice_indexing[None-strided_4_offset] 5.77 32.55 5.64×
test_write_array[memory-chunks=1K,no_shards-None] 19.16 106.12 5.54×
test_slice_indexing[(50,50,50)-mixed_slice] 0.23 1.21 5.29×
test_slice_indexing[(50,50,50)-strided_4-get_latency] 15.82 82.14 5.19×
test_write_array[memory-chunks=1K,shards=1K-gzip] 100.60 451.93 4.49×
test_read_array[local-chunks=1K,shards=1K-None] 66.94 298.83 4.46×
test_write_array[memory-chunks=1K,shards=1K-None] 71.06 315.12 4.43×
test_read_array[memory-chunks=1K,shards=1K-None] 43.90 192.21 4.38×
test_sharded_morton_single_chunk[(32,32,32)] 0.18 0.74 4.12×
test_read_array[memory-chunks=1K,shards=1K-gzip] 68.96 283.15 4.11×
test_read_array[local-chunks=1K,shards=1K-gzip] 90.06 368.79 4.09×
test_sharded_morton_single_chunk[(33,33,33)] 0.19 0.76 4.03×
test_slice_indexing[(50,50,50)-full_slice-get_latency] 20.53 79.62 3.88×
test_sharded_morton_single_chunk[(30,30,30)] 0.20 0.72 3.69×
test_slice_indexing[(50,50,50)-strided_4_offset-get_latency] 17.73 50.27 2.84×
test_slice_indexing[None-strided_4_offset-get_latency] 19.29 48.98 2.54×
test_slice_indexing[None-strided_4-get_latency] 43.55 102.36 2.35×
test_slice_indexing[None-full_slice-get_latency] 46.77 101.82 2.18×
test_write_array[local-chunks=1K,shards=1K-gzip] 367.61 733.76 2.00×
test_slice_indexing[(50,50,50)-(0,0,0)-get_latency] 0.49 0.87 1.79×
test_slice_indexing[(50,50,50)-(slice(None,10,None))*3-get_latency] 0.50 0.88 1.77×
test_slice_indexing[None-mixed_slice-get_latency] 0.58 1.00 1.74×
test_slice_indexing[(50,50,50)-mixed_slice-get_latency] 1.12 1.92 1.72×
test_write_array[local-chunks=1K,shards=1K-None] 390.76 619.05 1.58×
test_write_array[local-chunks=1K,no_shards-None] 225.86 340.42 1.51×
test_write_array[local-chunks=1K,no_shards-gzip] 284.20 400.62 1.41×
test_slice_indexing[None-(slice(None,10,None))*3-get_latency] 0.31 0.43 1.39×
test_slice_indexing[None-(0,0,0)-get_latency] 0.33 0.43 1.28×
test_morton_order_iter[(30,30,30)] 104.17 123.84 1.19×
test_morton_order_iter[(16,16,16)] 2.56 3.01 1.18×
test_morton_order_iter[(10,10,10)] 7.44 8.61 1.16×
test_morton_order_iter[(8,8,8)] 0.31 0.36 1.15×
test_morton_order_iter[(33,33,33)] 644.86 740.97 1.15×
test_morton_order_iter[(20,20,20)] 78.57 88.98 1.13×
test_sharded_morton_write_single_chunk[(33,33,33)] 668.32 754.39 1.13×
test_morton_order_iter[(32,32,32)] 22.80 24.85 1.09×
test_sharded_morton_write_single_chunk[(30,30,30)] 130.08 129.83 1.00×
test_sharded_morton_write_single_chunk[(32,32,32)] 48.33 32.43 0.67×

d-v-b added 30 commits February 18, 2026 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant